Enhancing Hi-C contact matrices for loop detection with Capricorn: a multiview diffusion model

Abstract Motivation High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts. Results We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix. Availability and implementation Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.


Cell line
Transcription factor ENCODE accession code Table S1: Accession codes for the ChIP-seq data used in the CTCF loop validation analysis following the protocol described in Rao et al. [2].

S1 Dataset accessibility S1.1 Hi-C experimental data
We collect the GM12878 and K562 cell line Hi-C experimental data from the Gene Expression Omnibus (GEO) database [1] with accession code GSE63525.

S1.2 CTCF validation ChIP-seq data
For the CTCF loop validation following Rao et al. [2] we use the ENCODE data portal [3] to access several ChIP-seq datasets.We provide the accession codes for these experiments in Table S1.In our secondary CTCF validation conducted following the protocol in Roayaei Ardakany et al. [4] and provided in Supplementary Section S9.3, we use the GEO accession code GSM1872886 to obtain CTCF ChIP-seq data.

S2 Additional biological views in Capricorn
Here, we provide more precise definitions of how Capricorn's additional biological views are computed.

S2.1 Distance-corrected view
Capricorn includes a view that accounts for the fact that nearby loci are more likely to interact by chance than more distal loci, such as the strong number of contacts observed along the diagonal of an uncorrected Hi-C matrix.Therefore, one common pre-processing step for Hi-C analyses involves correcting for this proximity bias [5][6][7].
Consider the input, low-coverage Hi-C contact matrix X ∈ N l×l , where l = ⌊ L ∆ ⌋ indicates the number of binned genomic loci in the matrix.The distance-correction module D : X → X (oe) is defined as follows.First, compute a distance-based expected matrix by taking the contacts for all pairs of loci (i, j) that are |i − j| apart in the genome.In practice, this is computed by taking the average along each diagonal of the contact matrix as This can be repeated for all i ∈ [1, . . ., n] and j ∈ [1, . . ., n] to produce E(X) ∈ R l×l .After observing that most distance-expected values occurred in the range [0, 16], we further clamp and normalize a distance-corrected version of the input matrix as 16 16 otherwise.

(S2)
We follow the same steps to compute E(Y) and Y (oe) given the high-coverage contact matrix Y.

S2.2 Chromatin loops view
We include two additional views based on chromatin loop structure identified from the input contact matrices.We use the Hi-C Computational Unbiased Peak Search (HiCCUPS) loop calling tool [2].HiCCUPS tests whether the measured contacts for possible (i, j) loop anchor points are significantly more enriched than the surrounding contacts.Specifically, the method combines a 10 × 10 donut kernel centered at a specific locus with horizontal, vertical, and lower-left quadrant kernels as described in Rao et al. [2].For a given locus (i, j), we use the distance-based expected matrix as defined in (S1) to compute for each kernel κ p = {κ 1 , . . ., κ 4 }.Then we compute the loop ratio for the input matrix as We follow the HiCCUPS algorithm and fit a Poisson distribution to the observed measurements with λchunking, where we set λ = 2 1/3 as in the default HiCCUPS settings [2].In the λ-chunking protocol, we define µ i,j = max c λ c subject to αp βp E(X)[i, j] < λ c , grouping pixels into relative intensity ranges of {[0, 1), [1, 2 1/3 ), [2 1/3 , 2 2/3 ), . . .}. Finally, we use the result of λ-chunking to compute the loop p-value as where f (• ≥ t|µ) represents the survival function of the Poisson distribution with expected value µ, and BH µi,j indicates the Benjamini-Hochberg false discovery rate correction [8,9] over the p-values for all pixels with the same λ-chunked µ i,j .We compute Y (loop−r) and Y (loop−p) based on the ground-truth high-coverage matrix Y in an analogous manner.

S2.3 TAD score view
We compute the insulation score (IS) [10] for TAD detection, using sliding windows along the contact matrix diagonal to identify insulated regions with low scores and in-domain regions with high scores.We then smooth the IS scores by averaging them over a 21-bin domain and assign the within-TAD label 1 to contiguous regions with monotonically decreasing ISs centered around some locus (i, j), where all ISs are still larger than the contact matrix average IS and the IS of (i, j) is at least 10% larger than the IS at the region boundary.

S3 Multi-view weighting details
Given the input, multi-view low-coverage contact matrix representation X(X) ∈ R 5×L/∆×L/∆ , we computed an initial view weight vector ω 0 ∈ R+ 5 as for the vth view, where var(•) denotes variance, Ξ indicates the set of all chromosomes tested for the given cell line, and X(X ξ ) indicates the views computed from the low-coverage cis-contact matrix X for chromosome ξ.
The computed ω 0 for both GM12878 and K562 is reported in Table S2.
We then further refine the weights to account for the difficulty of generating the high-resolution version of each view.Specifically, we compute a view-specific difficulty score using the validation set, as where mse(•, •) indicates the mean squared error, f (•) is the diffusion model backbone, and (X val , Y val ) are the paired low-and high-resolution contact matrices in the validation set.Finally, we compute ω ∈ R+ 5 as We explicitly set the weight of the 0th view, which contains the low-coverage input contact matrix without further alteration, to 1.This avoids altering the basic matrix view that we aim to enhance.The precise weight vector ω is given in Table S2 for both GM12878-based training and K562-based training.During all subsequent training runs, we more precisely compute the input as For simplicity of notation, we refer to this as X(X) and Ỹ(Y) in the manuscript for all model training after the initial weight tuning.

S4 Diffusion models
Diffusion models leverage a latent Markov chain framework that iteratively denoises an input to produce diverse and realistic outputs [11,12].The models are split into a T -step forward process q(y t |y t−1 ) and T -step reverse process p(y t−1 |y t , x), where y 0 ∈ R C×W ×H is the C-channel original, high-fidelity image, y t ∈ R C×W ×H is a latent variable, and x is an optional input term on which to condition the process.Specifically, the forward process computes progressively noisier latent representations of the true image as for t = 1, . . ., T , where β t is defined by a noise schedule, such as a cosine schedule [13,14], and α t = 1 − β t .The reverse diffusion process, optionally conditioned on input x, can be then parameterized by a learnable ϵ θ (•) as where ϵ θ (•) is trained to predict noise from the previous step's y t prediction and the conditional input x.We denote a sample from a step of the reverse process as ŷt−1 ∼ p θ (y t−1 |y t , x, t), or ŷt−1 ∼ p θ for brevity of notation.

S4.1 Conditional diffusion models
We focus on conditional diffusion models that are trained with a MSE objective in the image's pixel space, or the bin space of a contact matrix.In this case, we compute the loss where w t is a diffusion loss hyperparameter impacting the the relative amount of sampled Gaussian noise [14].
The expectation over t ∈ {1, . . ., T } reflects how we can compute loss at a different number of steps of the reverse diffusion process, introducing t noise samples to the target y 0 during the forward process before denoising.

S5 Cross-cell-line loop statistics
In addition to the loop F1 score, here we report the total number of loops called and the precision and recall of the called loops, still treating loops called from the high-coverage data as the ground truth.We report results using the HiCCUPS [2], Mustache [4], and Chromosight [15] loop calling tools for evaluation; in all cases, Capricorn's loop-based views were computed using the HiCCUPS algorithm.For reference, the total number of loops called from the original high-coverage Hi-C matrices is provided in Table S3.The performance metrics for all three loop calling algorithms given the low-coverage (LC) data and the enhanced matrices produced by Capricorn (ours), HiCNN [16], HiCARN-1 [17], HiCARN-2 [17], and HiCSR [18] are shown in Table S4.

Cell line
Weight vector

S6 Cross-chromosome loop statistics
In addition to the loop F1 score, here we report the total number of loops called and the precision and recall of the called loops in the cross-chromosome experimental setting, still treating loops called from the high-coverage data as the ground truth.We report results in  S6: Cross-cell-line, cross-chromosome resolution enhancement based on identifiable loops using the HiC-CUPS [2] loop caller.The best-performing resolution enhancement method for each cell line and each metric is bolded.Reported performance numbers are the average performance across 4 test chromosomes.

S7 Cross-cell-line, cross-chromosome loop statistics
In addition to the loop F1 score, here we report the total number of loops called and the precision and recall of the called loops in the cross-cell-line, cross-chromosome experimental setting, still treating loops called from the high-coverage data as the ground truth.We report results in Table S6 using the HiCCUPS [2] loop calling tools for evaluation.Note that there is no comparison with the low-coverage input baseline in this setting, as the low-coverage for the test cell line is not available to the enhancement models during inference.

S8 Capricorn view ablation study
In addition to applying our multi-view framework to the other comparison approaches, we also conducted an ablation study over the views included in Capricorn.Specifically, we considered the 8 different approaches listed in Table S7, still using Capricorn's diffusion model backbone for resolution enhancement.
We then compared the loop F1 score and MSE in the cross-cell-line setting for all ablated methods.Results are given in Table S8.

S9 Additional CTCF validation
We repeat the CTCF validation procedure described in Rao et al. [2] for loops called with other loop calling methods.As in the CTCF validation analysis with HiCCUPS loop calls, we compare the percent of loops with flanking convergent CTCF motifs for high-coverage, low-coverage, and Capricorn-enhanced contact matrices.
In general, we would expect that well-enhanced contact matrices produce loop calls with similar levels of CTCF support to the high-coverage experimental data.However, manual inspection of Mustache [4] and Chromosight Ablation study name X X (oe) (X (loop−p) , X (loop−r) ) X (tad)  HiC [15] loops suggests that both of these loop calling methods struggle more with enhanced matrices in sparse regions, identifying loop peaks where none appear to exist (Fig. S1).

S9.1 CTCF validation for loops called with Mustache
When comparing CTCF support of loops called by Mustache [4], we report both the overall support for all loops identified in the Capricorn-enhanced contact matrices and the support for the k highest-confidence loop calls, where k is selected as the number of loops called from the high-coverage data and confidence is determined by FDR.

S9.2 CTCF validation for loops called with Chromosight
When comparing CTCF support of loops called by Chromosight [15], we report both the overall support for all loops identified in the Capricorn-enhanced contact matrices and the support for the k highest-confidence loop calls, where k is selected as the number of loops called from the high-coverage data and confidence is determined by Chromosight's loop score.

S9.3 Secondary CTCF validation
In addition to the CTCF-motif-based loop validation following the protocol of Rao et al. [2], here we also provide a CTCF-based validation following the steps in Roayaei Ardakany et al. [4] for GM12878.We downloaded a   S10: Analysis of CTCF validation for loops called with Chromosight.We show the total number of called loops, loops also supported by flanking convergent CTCF motifs in the DNA sequence, and percentage of loops that were validated.Top-k indicates that the analysis was restricted to the k loops that Chromosight called most confidently.
GM12878 CTCF ChIA-PET [19] experiment1 from GEO.The ChIA-PET experimental data identified 92,807 total peaks, indicating CTCF binding sites along the genome.
In this analysis, we considered a called loop to be "CTCF validated" if there was a CTCF ChIA-PET peak within 10 kb of the loop locus.Loops from the high-coverage, low-coverage, and Capricorn-enhanced contact matrices were called with HiCCUPS [2].Results are shown in Table S11

Figure S1 :
Figure S1: Sample loop calls across resolution enhancement methods and loop callers.a-c, Comparison of ground-truth and computationally-enhanced submatrices covering genomic loci from 47.3 Mb to 48.1 Mb on GM12878 chromosome 17.Blue circles indicated called loops in the ground-truth or predicted high-coverage data using the HiCCUPS (a), Mustache (b), and Chromosight (c) loop calling tools.Circles for loops that are also called in the ground truth data are filled in; loops that are called from the generated data but not the ground-truth data are empty circles.

Table S2 :
Capricorn's computed weights for each view when training on either GM12878 or K562.The initial, distribution-based weights ω 0 and further refined, difficulty-based weights ω are reported.For training the final model, we use the ω weight vector.

Table S3 :
Number of loops called from the original high-coverage data across loop calling methods and cell lines.

Table S5 :
[2]leS5using the HiCCUPS[2]loop calling tools for evaluation.Cross-chromosome resolution enhancement based on identifiable loops using the HiCCUPS [2] loop caller."LC" indicates the low-coverage contact map provided as input to the resolution enhancement methods.The best-performing resolution enhancement method for each cell line and each metric is bolded.Reported performance numbers are the average performance across 4 test chromosomes.

Table S7 :
Ablation studies conducted over Capricorn's multi-view framework.We considered combining the original Hi-C contact matrices with only one additional source of information, as well the the full Capricorn framework missing only one source of information.

Table S8 :
[8]parison of different view ablation study results in the Capricorn framework for both the GM12878 and K562 cell lines.Experiments are conducted in the cross-cell-line setting.The average chromosome loop F1 score using the HiCCUPS loop calling algorithm is shown.*indicatesthat the result is significantly worse than the complete Capricorn framework (full; Wilcoxon signed-rank test with Benjamini-Hochberg[8]correction with a controlled FDR of 0.1).The best-performing Capricorn variant is shown in bold.

Table S9 :
Analysis of CTCF validation for loops called with Mustache.We show the total number of called loops, loops also supported by flanking convergent CTCF motifs in the DNA sequence, and percentage of loops that were validated.Top-k indicates that the analysis was restricted to the k loops that Mustache called most confidently.